10 research outputs found
Hybrid Retrieval-Augmented Generation for Real-time Composition Assistance
Retrieval augmented models show promise in enhancing traditional language
models by improving their contextual understanding, integrating private data,
and reducing hallucination. However, the processing time required for retrieval
augmented large language models poses a challenge when applying them to tasks
that require real-time responses, such as composition assistance.
To overcome this limitation, we propose the Hybrid Retrieval-Augmented
Generation (HybridRAG) framework that leverages a hybrid setting that combines
both client and cloud models. HybridRAG incorporates retrieval-augmented memory
generated asynchronously by a Large Language Model (LLM) in the cloud. By
integrating this retrieval augmented memory, the client model acquires the
capability to generate highly effective responses, benefiting from the LLM's
capabilities. Furthermore, through asynchronous memory integration, the client
model is capable of delivering real-time responses to user requests without the
need to wait for memory synchronization from the cloud. Our experiments on
Wikitext and Pile subsets show that HybridRAG achieves lower latency than a
cloud-based retrieval-augmented LLM, while outperforming client-only models in
utility
PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis
Major cloud providers have employed advanced AI-based solutions like large
language models to aid humans in identifying the root causes of cloud
incidents. Despite the growing prevalence of AI-driven assistants in the root
cause analysis process, their effectiveness in assisting on-call engineers is
constrained by low accuracy due to the intrinsic difficulty of the task, a
propensity for LLM-based approaches to hallucinate, and difficulties in
distinguishing these well-disguised hallucinations. To address this challenge,
we propose to perform confidence estimation for the predictions to help on-call
engineers make decisions on whether to adopt the model prediction. Considering
the black-box nature of many LLM-based root cause predictors, fine-tuning or
temperature-scaling-based approaches are inapplicable. We therefore design an
innovative confidence estimation framework based on prompting
retrieval-augmented large language models (LLMs) that demand a minimal amount
of information from the root cause predictor. This approach consists of two
scoring phases: the LLM-based confidence estimator first evaluates its
confidence in making judgments in the face of the current incident that
reflects its ``grounded-ness" level in reference data, then rates the root
cause prediction based on historical references. An optimization step combines
these two scores for a final confidence assignment. We show that our method is
able to produce calibrated confidence estimates for predicted root causes,
validate the usefulness of retrieved historical data and the prompting strategy
as well as the generalizability across different root cause prediction models.
Our study takes an important move towards reliably and effectively embedding
LLMs into cloud incident management systems
Solving the Batch Stochastic Bin Packing Problem in Cloud: A Chance-constrained Optimization Approach
This paper investigates a critical resource allocation problem in the first
party cloud: scheduling containers to machines. There are tens of services and
each service runs a set of homogeneous containers with dynamic resource usage;
containers of a service are scheduled daily in a batch fashion. This problem
can be naturally formulated as Stochastic Bin Packing Problem (SBPP). However,
traditional SBPP research often focuses on cases of empty machines, whose
objective, i.e., to minimize the number of used machines, is not well-defined
for the more common reality with nonempty machines. This paper aims to close
this gap. First, we define a new objective metric, Used Capacity at Confidence
(UCaC), which measures the maximum used resources at a probability and is
proved to be consistent for both empty and nonempty machines, and reformulate
the SBPP under chance constraints. Second, by modeling the container resource
usage distribution in a generative approach, we reveal that UCaC can be
approximated with Gaussian, which is verified by trace data of real-world
applications. Third, we propose an exact solver by solving the equivalent
cutting stock variant as well as two heuristics-based solvers -- UCaC best fit,
bi-level heuristics. We experimentally evaluate these solvers on both synthetic
datasets and real application traces, demonstrating our methodology's advantage
over traditional SBPP optimal solver minimizing the number of used machines,
with a low rate of resource violations.Comment: To appear in SIGKDD 2022 as Research Track pape
Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation
Recent advancements in Large Language Models (LLMs) have revolutionized
decision-making by breaking down complex problems into more manageable language
sequences referred to as ``thoughts''. An effective thought design should
consider three key perspectives: performance, efficiency, and flexibility.
However, existing thought can at most exhibit two of these attributes. To
address these limitations, we introduce a novel thought prompting approach
called ``Everything of Thoughts'' (XoT) to defy the law of ``Penrose triangle
of existing thought paradigms. XoT leverages pretrained reinforcement learning
and Monte Carlo Tree Search (MCTS) to incorporate external domain knowledge
into thoughts, thereby enhancing LLMs' capabilities and enabling them to
generalize to unseen problems efficiently. Through the utilization of the
MCTS-LLM collaborative thought revision framework, this approach autonomously
produces high-quality comprehensive cognitive mappings with minimal LLM
interactions. Additionally, XoT empowers LLMs to engage in unconstrained
thinking, allowing for flexible cognitive mappings for problems with multiple
solutions. We evaluate XoT on several challenging multi-solution
problem-solving tasks, including Game of 24, 8-Puzzle, and Pocket Cube. Our
results demonstrate that XoT significantly outperforms existing approaches.
Notably, XoT can yield multiple solutions with just one LLM call, showcasing
its remarkable proficiency in addressing complex problems across diverse
domains.Comment: 17 pages, 5 figure
ImDiffusion: Imputed Diffusion Models for Multivariate Time Series Anomaly Detection
Anomaly detection in multivariate time series data is of paramount importance
for ensuring the efficient operation of large-scale systems across diverse
domains. However, accurately detecting anomalies in such data poses significant
challenges. Existing approaches, including forecasting and reconstruction-based
methods, struggle to address these challenges effectively. To overcome these
limitations, we propose a novel anomaly detection framework named ImDiffusion,
which combines time series imputation and diffusion models to achieve accurate
and robust anomaly detection. The imputation-based approach employed by
ImDiffusion leverages the information from neighboring values in the time
series, enabling precise modeling of temporal and inter-correlated
dependencies, reducing uncertainty in the data, thereby enhancing the
robustness of the anomaly detection process. ImDiffusion further leverages
diffusion models as time series imputers to accurately capturing complex
dependencies. We leverage the step-by-step denoised outputs generated during
the inference process to serve as valuable signals for anomaly prediction,
resulting in improved accuracy and robustness of the detection process.
We evaluate the performance of ImDiffusion via extensive experiments on
benchmark datasets. The results demonstrate that our proposed framework
significantly outperforms state-of-the-art approaches in terms of detection
accuracy and timeliness. ImDiffusion is further integrated into the real
production system in Microsoft and observe a remarkable 11.4% increase in
detection F1 score compared to the legacy approach. To the best of our
knowledge, ImDiffusion represents a pioneering approach that combines
imputation-based techniques with time series anomaly detection, while
introducing the novel use of diffusion models to the field.Comment: To appear in VLDB 2024.Code:
https://github.com/17000cyh/IMDiffusion.gi
Learning Cooperative Oversubscription for Cloud by Chance-Constrained Multi-Agent Reinforcement Learning
Oversubscription is a common practice for improving cloud resource
utilization. It allows the cloud service provider to sell more resources than
the physical limit, assuming not all users would fully utilize the resources
simultaneously. However, how to design an oversubscription policy that improves
utilization while satisfying the some safety constraints remains an open
problem. Existing methods and industrial practices are over-conservative,
ignoring the coordination of diverse resource usage patterns and probabilistic
constraints. To address these two limitations, this paper formulates the
oversubscription for cloud as a chance-constrained optimization problem and
propose an effective Chance Constrained Multi-Agent Reinforcement Learning
(C2MARL) method to solve this problem. Specifically, C2MARL reduces the number
of constraints by considering their upper bounds and leverages a multi-agent
reinforcement learning paradigm to learn a safe and optimal coordination
policy. We evaluate our C2MARL on an internal cloud platform and public cloud
datasets. Experiments show that our C2MARL outperforms existing methods in
improving utilization () under different levels of safety
constraints
Diffusion-based Time Series Data Imputation for Microsoft 365
Reliability is extremely important for large-scale cloud systems like
Microsoft 365. Cloud failures such as disk failure, node failure, etc. threaten
service reliability, resulting in online service interruptions and economic
loss. Existing works focus on predicting cloud failures and proactively taking
action before failures happen. However, they suffer from poor data quality like
data missing in model training and prediction, which limits the performance. In
this paper, we focus on enhancing data quality through data imputation by the
proposed Diffusion+, a sample-efficient diffusion model, to impute the missing
data efficiently based on the observed data. Our experiments and application
practice show that our model contributes to improving the performance of the
downstream failure prediction task
TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems
Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the
reliability of microservice systems. However, performing RCA on modern
microservice systems can be challenging due to their large scale, as they
usually comprise hundreds of components, leading significant human effort. This
paper proposes TraceDiag, an end-to-end RCA framework that addresses the
challenges for large-scale microservice systems. It leverages reinforcement
learning to learn a pruning policy for the service dependency graph to
automatically eliminates redundant components, thereby significantly improving
the RCA efficiency. The learned pruning policy is interpretable and fully
adaptive to new RCA instances. With the pruned graph, a causal-based method can
be executed with high accuracy and efficiency. The proposed TraceDiag framework
is evaluated on real data traces collected from the Microsoft Exchange system,
and demonstrates superior performance compared to state-of-the-art RCA
approaches. Notably, TraceDiag has been integrated as a critical component in
the Microsoft M365 Exchange, resulting in a significant improvement in the
system's reliability and a considerable reduction in the human effort required
for RCA
Assess and Summarize: Improve Outage Understanding with Large Language Models
Cloud systems have become increasingly popular in recent years due to their
flexibility and scalability. Each time cloud computing applications and
services hosted on the cloud are affected by a cloud outage, users can
experience slow response times, connection issues or total service disruption,
resulting in a significant negative business impact. Outages are usually
comprised of several concurring events/source causes, and therefore
understanding the context of outages is a very challenging yet crucial first
step toward mitigating and resolving outages. In current practice, on-call
engineers with in-depth domain knowledge, have to manually assess and summarize
outages when they happen, which is time-consuming and labor-intensive. In this
paper, we first present a large-scale empirical study investigating the way
on-call engineers currently deal with cloud outages at Microsoft, and then
present and empirically validate a novel approach (dubbed Oasis) to help the
engineers in this task. Oasis is able to automatically assess the impact scope
of outages as well as to produce human-readable summarization. Specifically,
Oasis first assesses the impact scope of an outage by aggregating relevant
incidents via multiple techniques. Then, it generates a human-readable summary
by leveraging fine-tuned large language models like GPT-3.x. The impact
assessment component of Oasis was introduced in Microsoft over three years ago,
and it is now widely adopted, while the outage summarization component has been
recently introduced, and in this article we present the results of an empirical
evaluation we carried out on 18 real-world cloud systems as well as a
human-based evaluation with outage owners. The results show that Oasis can
effectively and efficiently summarize outages, and lead Microsoft to deploy its
first prototype which is currently under experimental adoption by some of the
incident teams
Automatic Root Cause Analysis via Large Language Models for Cloud Incidents
Ensuring the reliability and availability of cloud services necessitates
efficient root cause analysis (RCA) for cloud incidents. Traditional RCA
methods, which rely on manual investigations of data sources such as logs and
traces, are often laborious, error-prone, and challenging for on-call
engineers. In this paper, we introduce RCACopilot, an innovative on-call system
empowered by the large language model for automating RCA of cloud incidents.
RCACopilot matches incoming incidents to corresponding incident handlers based
on their alert types, aggregates the critical runtime diagnostic information,
predicts the incident's root cause category, and provides an explanatory
narrative. We evaluate RCACopilot using a real-world dataset consisting of a
year's worth of incidents from Microsoft. Our evaluation demonstrates that
RCACopilot achieves RCA accuracy up to 0.766. Furthermore, the diagnostic
information collection component of RCACopilot has been successfully in use at
Microsoft for over four years